Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work Stealing
نویسندگان
چکیده
Nested (fork-join) parallelism eases parallel programming by enabling high-level expression of and leaving the mapping between tasks hardware to runtime scheduler. A challenge in dynamic scheduling nested is how exploit data locality, which has become more demanding deep cache hierarchies modern processors with a large number cores. This paper introduces almost deterministic work stealing (ADWS) , efficiently exploits locality deterministically planning cache-hierarchy-aware schedule, while allowing little variety facilitate load balancing. Furthermore, we propose an extension our prior on ADWS achieve better shared utilization. The improved version scheduler called xmlns:xlink="http://www.w3.org/1999/xlink">multi-level ADWS . idea that only part computation whose working set size small enough fit into scheduled within recursively, thus avoiding excessive capacity misses. Our evaluation benchmark decision tree construction demonstrated multi-level outperformed conventional random Cilk Plus 61% it showed 40% performance improvement over previous design.
منابع مشابه
Improving Cache Memory Utilization
In this paper, an efficient technique is proposed to manage the cache memory. The proposed technique introduces some modifications on the well-known set associative mapping technique. This modification requires a little alteration in the structure of the cache memory and on the way by which it can be referenced. The proposed alteration leads to increase the set size virtually and consequently t...
متن کاملA Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores
Reordering instructions and data layout can bring significant performance improvement for memory bounded applications. Parallelizing such applications requires a careful design of the algorithm in order to keep the locality of the sequential execution. In this paper, we aim at finding a good parallelization of memory bounded applications on multicore that preserves the advantage of a shared cac...
متن کاملWork stealing for GPU-accelerated parallel programs in a global address space framework
Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parall...
متن کاملImproving Memory Utilization in Cache Coherence Directories
Efficiently maintaining cache coherence is a major problem in large-scale shared memory multiprocessors. Hardware directory coherence schemes have very high memory requirements, while software-directed schemes must rely on imprecise compile-time memory disambiguation. Recently proposed dynamically tagged directory schemes allocate pointers to blocks only as they are referenced, which significan...
متن کاملPiecewise execution of nested data-parallel programs
The technique of flattening nested data parallelism combines all the independent operations in nested apply-to-all constructs and generates large amounts of potential parallelism for both regular and irregular expressions. However, the resulting data-parallel programs can have enormous memory requirements, limiting their utility. In this paper, we present piecewise execution, an automatic metho...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Parallel and Distributed Systems
سال: 2022
ISSN: ['1045-9219', '1558-2183', '2161-9883']
DOI: https://doi.org/10.1109/tpds.2022.3196192